Skip to content

[hiptensor] Review and update all ACTOR_CRITIC kernel UIDs for contractions#7897

Open
evedovelli wants to merge 5 commits into
developfrom
users/evedovel/hiptensor/fix-contraction-uids
Open

[hiptensor] Review and update all ACTOR_CRITIC kernel UIDs for contractions#7897
evedovelli wants to merge 5 commits into
developfrom
users/evedovel/hiptensor/fix-contraction-uids

Conversation

@evedovelli
Copy link
Copy Markdown
Contributor

Summary

HIPTENSOR_ALGO_ACTOR_CRITIC contraction tests were failing with HIPTENSOR_STATUS_EXECUTION_FAILED on all data types and ranks after a Composable Kernel update. The root cause was a CK commit
(ROCm/composable_kernel@e58096d, "add composable kernel support on gfx1250") that changed all F16/BF16/F32 contraction instances from KPerBlock=16 to KPerBlock=32. The
hard-coded unique_id kernel name strings in ActorCritic[UnaryOps]Selection in contraction_selection.cpp still referenced KPerBlock=16, so findByKernelName returned nullptr for every actor-critic lookup.

This change fixes these invalid kernel selections and add a set the scripts to aid on probing and patching new kernel UIDs in the future.

Changes

library/src/contraction/contraction_selection.cpp

  • Updated all unique_id strings across all 40 ActorCritic[UnaryOps]Selection specializations (22 regular + 18 unary-ops) to match the kernel names compiled by the current CK version. Winners were selected by running
    HIPTENSOR_ALGO_DEFAULT_PATIENT brute-force timing on gfx950 (F16/BF16/F32/F64) and gfx1201 (data/compute type combinations not available on gfx950), then taking the best-performing kernel per (type, layout, rank) slot.
  • Removed 9 stale rank == 2 && dim1 special-case branches from the row-major path of all BILINEAR specializations. These branches selected a KPerBlock=16 kernel for rank-2 problems with at least one unit-length dimension, which no
    longer exists in CK.

scripts/update_actor_critic/ (new)

A three-script toolset to re-derive and apply correct actor-critic kernel names after any future CK kernel parameter changes, without needing to manually inspect kernel name strings.

  • _actor_critic_combos.py — shared definitions for all 40 specializations: type combinations, YAML probe config generation (with correct mode numbering and memory-layout–appropriate problem sizes per rank), and BRUTE_FORCE_KERNEL_PERF
    log parsing.
  • probe_actor_critic.py — runs HIPTENSOR_ALGO_DEFAULT_PATIENT for every (type, rank, layout) combination on a single GPU, saves winning kernel names and timing to JSON. Intended to be run once per GPU ASIC.
  • patch_actor_critic.py — merges results from one or more probe JSONs (first-file-wins on conflict) and patches contraction_selection.cpp in-place. Supports --dry-run and --show-conflicts.

Test Plan and Results

  • All HIPTENSOR_ALGO_ACTOR_CRITIC contraction tests pass for F16, BF16, F32, F64, CF32, CF64 on:
    • gfx90a
    • gfx942
    • gfx950
    • gfx1101
    • gfx1201
  • All HIPTENSOR_ALGO_ACTOR_CRITIC contraction tests pass for F16, BF16, F32 on:
    • gfx1101
    • gfx1201
  • All HIPTENSOR_ALGO_DEFAULT_PATIENT contraction tests continue to pass (no regression)
  • probe_actor_critic.py produces valid JSON when run against a built test directory
  • patch_actor_critic.py --dry-run reports expected changes without writing

Submission Checklist

…ctions

All former kernels with KPerBlock7=16 became invalid in Composable Kernel.
This change adds scripts to run all required variations contractions and
probe for the most efficient kernels, and update the code with the obtained
results.

The selected kernels were the best performing kernels on gfx950 and on
gfx1201 (for date and compute combin ations not available on gfx950).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes hipTensor ACTOR_CRITIC contraction kernel selection failures after a Composable Kernel update by updating hard-coded actor-critic kernel UID strings to match currently compiled CK kernel names, and adds scripts to re-probe/patch these UIDs for future CK changes.

Changes:

  • Updated ActorCritic(Selection|SelectionUnaryOps) unique_id kernel-name strings in contraction_selection.cpp to match the current CK contraction instance names.
  • Removed stale rank == 2 && dim1 special-case selection branches in row-major bilinear specializations (no longer valid with current CK instances).
  • Added a small probe/patch script toolset to regenerate and apply winning kernel UIDs from DEFAULT_PATIENT brute-force runs.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
projects/hiptensor/scripts/update_actor_critic/probe_actor_critic.py New probe script to brute-force DEFAULT_PATIENT winners and write per-(type,layout,rank) kernel-name results to JSON.
projects/hiptensor/scripts/update_actor_critic/patch_actor_critic.py New patch script to merge probe JSONs and update unique_id strings in contraction_selection.cpp.
projects/hiptensor/scripts/update_actor_critic/_actor_critic_combos.py Shared definitions/helpers for probe YAML generation and parsing brute-force perf logs.
projects/hiptensor/library/src/contraction/contraction_selection.cpp Updates ACTOR_CRITIC kernel UID strings and removes obsolete rank2/dim1 row-major bilinear branches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/hiptensor/scripts/update_actor_critic/probe_actor_critic.py Outdated
Comment thread projects/hiptensor/scripts/update_actor_critic/patch_actor_critic.py Outdated
evedovelli and others added 3 commits May 29, 2026 15:53
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@evedovelli evedovelli requested a review from a team as a code owner May 29, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants